Final ¶

Context¶

  • Why is this problem important to solve?

    The solution to this problem should better the ROI for new investments. Improve the knoledge about the customer, therefore, the organization are able to create customized strategy according to customer cluster.

The objectives:¶

  • What is the intended goal?

    Define customer clusters according to delivered data. Then, we have to find out the key features to group the customers

The key questions:¶

  • What are the key questions that need to be answered?

What are the characteristics that should be common in a cluster customer?

The problem formulation:¶

  • What is it that we are trying to solve using data science?

Find out the customer clusters


Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Important Notes¶

  • This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook will give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise.

  • In the notebook, there are markdown cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.

  • The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.

  • All the outputs in the notebook are just for reference and can be different if you follow a different approach.

  • There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they wish to explore different techniques.

Milestone 1¶

Loading Libraries¶

In [131]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

# To compute distances
from scipy.spatial.distance import cdist

# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer

# Importing PCA
from sklearn.decomposition import PCA

# To encode the variable
from sklearn.preprocessing import LabelEncoder

# Importing TSNE
from sklearn.manifold import TSNE

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To compute distances
from scipy.spatial.distance import pdist

# To import K-Medoids
from sklearn_extra.cluster import KMedoids

from sklearn.cluster import DBSCAN

# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# Library to extract datetime features
from datetime import datetime
from datetime import timedelta

Let us load the data¶

In [132]:
# constants. 

year_today = datetime.today().year
In [133]:
# loading the dataset
data_original = pd.read_csv("F:/DataScienceAI/capstone project/marketing_campaign.csv")

data = data_original.copy()

Check the shape of the data¶

In [134]:
# Print the shape of the data
data_shape = data.shape
print(data_shape)
(2240, 27)

**Observations and Insights:¶

We have 2240 rows and 27 columns **

Understand the data by observing a few rows¶

In [135]:
# View first 5 rows
data.head()
Out[135]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 10 4 7 0 0 0 0 0 0 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 1 2 5 0 0 0 0 0 0 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 2 10 4 0 0 0 0 0 0 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 0 4 6 0 0 0 0 0 0 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 3 6 5 0 0 0 0 0 0 0

5 rows × 27 columns

In [136]:
# View last 5 rows Hint: Use tail() method
data.tail()
Out[136]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
2235 10870 1967 Graduation Married 61223.0 0 1 13-06-2013 46 709 ... 3 4 5 0 0 0 0 0 0 0
2236 4001 1946 PhD Together 64014.0 2 1 10-06-2014 56 406 ... 2 5 7 0 0 0 1 0 0 0
2237 7270 1981 Graduation Divorced 56981.0 0 0 25-01-2014 91 908 ... 3 13 6 0 1 0 0 0 0 0
2238 8235 1956 Master Together 69245.0 0 1 24-01-2014 8 428 ... 5 10 3 0 0 0 0 0 0 0
2239 9405 1954 PhD Married 52869.0 1 1 15-10-2012 40 84 ... 1 4 7 0 0 0 0 0 0 1

5 rows × 27 columns

**Observations and Insights:¶

Non-numeric variables: Education, Marital_status Dt_customer is a date in dd-mm-yyyy format **

Let us check the data types and and missing values of each column¶

In [137]:
# Check the datatypes of each column. Hint: Use info() method
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB
In [138]:
data.describe().T
Out[138]:
count mean std min 25% 50% 75% max
ID 2240.0 5592.159821 3246.662198 0.0 2828.25 5458.5 8427.75 11191.0
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00 51381.5 68522.00 666666.0
Kidhome 2240.0 0.444196 0.538398 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.506250 0.544538 0.0 0.00 0.0 1.00 2.0
Recency 2240.0 49.109375 28.962453 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.302232 39.773434 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2240.0 0.012946 0.113069 0.0 0.00 0.0 0.00 1.0
Complain 2240.0 0.009375 0.096391 0.0 0.00 0.0 0.00 1.0
Response 2240.0 0.149107 0.356274 0.0 0.00 0.0 0.00 1.0
In [139]:
# We have null data in Income column
In [140]:
# Find the percentage of missing values in each column of the data
data['Income'].count()
percentage_missing = 100 - (data['Income'].count() / 2240 *100)
print(percentage_missing)
1.0714285714285694
In [141]:
# replacing missing values 

data['Income'] = np.where(data['Income'].isnull() , data['Income'].mean(), data['Income'])

**Observations and Insights:¶

data['Income'] has missing values 1.1% of values are no present. The null values were replaced by the mean of the column

we have 27 columns. 3 Object, 1 float, 23 Int

Year_Birth: Min value 1893, inconsistent data. Format : %Y

Dt_Customer ; Object, dd-mm-yyyy

**

We can observe that ID has no null values. Also the number of unique values are equal to the number of observations. So, ID looks like an index for the data entry and such a column would not be useful in providing any predictive power for our analysis. Hence, it can be dropped.

Dropping the ID column

In [142]:
# Remove ID column from data. Hint: Use inplace = True
data.drop(['ID'], axis =1, inplace = True)
In [143]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year_Birth           2240 non-null   int64  
 1   Education            2240 non-null   object 
 2   Marital_Status       2240 non-null   object 
 3   Income               2240 non-null   float64
 4   Kidhome              2240 non-null   int64  
 5   Teenhome             2240 non-null   int64  
 6   Dt_Customer          2240 non-null   object 
 7   Recency              2240 non-null   int64  
 8   MntWines             2240 non-null   int64  
 9   MntFruits            2240 non-null   int64  
 10  MntMeatProducts      2240 non-null   int64  
 11  MntFishProducts      2240 non-null   int64  
 12  MntSweetProducts     2240 non-null   int64  
 13  MntGoldProds         2240 non-null   int64  
 14  NumDealsPurchases    2240 non-null   int64  
 15  NumWebPurchases      2240 non-null   int64  
 16  NumCatalogPurchases  2240 non-null   int64  
 17  NumStorePurchases    2240 non-null   int64  
 18  NumWebVisitsMonth    2240 non-null   int64  
 19  AcceptedCmp3         2240 non-null   int64  
 20  AcceptedCmp4         2240 non-null   int64  
 21  AcceptedCmp5         2240 non-null   int64  
 22  AcceptedCmp1         2240 non-null   int64  
 23  AcceptedCmp2         2240 non-null   int64  
 24  Complain             2240 non-null   int64  
 25  Response             2240 non-null   int64  
dtypes: float64(1), int64(22), object(3)
memory usage: 455.1+ KB

Exploratory Data Analysis¶

Let us now explore the summary statistics of numerical variables¶

In [144]:
# Explore basic summary statistics of numeric variables. Hint: Use describe() method.

data.describe().T
Out[144]:
count mean std min 25% 50% 75% max
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2240.0 52247.251354 25037.797168 1730.0 35538.75 51741.5 68289.75 666666.0
Kidhome 2240.0 0.444196 0.538398 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.506250 0.544538 0.0 0.00 0.0 1.00 2.0
Recency 2240.0 49.109375 28.962453 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.302232 39.773434 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2240.0 0.012946 0.113069 0.0 0.00 0.0 0.00 1.0
Complain 2240.0 0.009375 0.096391 0.0 0.00 0.0 0.00 1.0
Response 2240.0 0.149107 0.356274 0.0 0.00 0.0 0.00 1.0

**Observations and Insights:¶

Income: Std is high then probably we have to review outliers ( Max number for instance ) 75% of values are less than 68289.75

Kidhome & Teenhome: Homogeneous and categorical data (0,1 or 2)

**

Let us also explore the summary statistics of all categorical variables and the number of unique observations in each category¶

In [145]:
# List of the categorical columns in the data
cols = ["Education", "Marital_Status", "Kidhome", "Teenhome", "Complain"]

Number of unique observations in each category

In [146]:
for column in cols:
    print("Unique values in", column, "are :")
    print(data[column].unique())
    print("*" * 50)
Unique values in Education are :
['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle']
**************************************************
Unique values in Marital_Status are :
['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']
**************************************************
Unique values in Kidhome are :
[0 1 2]
**************************************************
Unique values in Teenhome are :
[0 1 2]
**************************************************
Unique values in Complain are :
[0 1]
**************************************************

Think About It:

  • We could observe from the summary statistics of categorical variables that the Education variable has 5 categories. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
  • Similarly, there are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?

Let us replace the "2n Cycle" category with "Master" in Education and "Alone", "Absurd, and "YOLO" with "Single" in Marital_Status¶

In [147]:
# Replace the category "2n Cycle" with the category "Master" and Master as PhD. Studies after the bachelor.
data['Education'] = np.where(data['Education'] == '2n Cycle', 'Master', data['Education'])  
data['Education'] = np.where(data['Education'] == 'PhD', 'Master', data['Education'])
# replacing strings by numbers:
# Basic = 0
# Graduation = 1
# Master = 2
data['Education'] = np.where(data['Education'] == 'Basic', 0, data['Education'])  # Hint: Use the replace() method and inplace=True
data['Education'] = np.where(data['Education'] == 'Graduation', 1, data['Education'])  # Hint: Use the replace() method and inplace=True
data['Education'] = np.where(data['Education'] == 'Master', 2, data['Education'])  # Hint: Use the replace() method and inplace=True

data['Education'] = data['Education'].astype('int64')
In [148]:
# Replace the categories "Alone", "Abusrd", "YOLO" with the category "Single"
# Together and Married status could be grouped in the same category.
# Single = 0 + widow + divorced.
# Relationship = 1  
# 
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Alone', 0, data['Marital_Status'])  
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Absurd', 0, data['Marital_Status']) 
data["Marital_Status"] = np.where( data['Marital_Status'] == 'YOLO', 0, data['Marital_Status'])  
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Together', 1, data['Marital_Status'])  
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Single', 0, data['Marital_Status'])  
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Married', 1, data['Marital_Status']) 
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Widow', 0, data['Marital_Status'])  
data["Marital_Status"] = np.where( data['Marital_Status'] == 'Divorced', 0, data['Marital_Status'])  


data['Marital_Status'] = data['Marital_Status'].astype('int64')
In [149]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year_Birth           2240 non-null   int64  
 1   Education            2240 non-null   int64  
 2   Marital_Status       2240 non-null   int64  
 3   Income               2240 non-null   float64
 4   Kidhome              2240 non-null   int64  
 5   Teenhome             2240 non-null   int64  
 6   Dt_Customer          2240 non-null   object 
 7   Recency              2240 non-null   int64  
 8   MntWines             2240 non-null   int64  
 9   MntFruits            2240 non-null   int64  
 10  MntMeatProducts      2240 non-null   int64  
 11  MntFishProducts      2240 non-null   int64  
 12  MntSweetProducts     2240 non-null   int64  
 13  MntGoldProds         2240 non-null   int64  
 14  NumDealsPurchases    2240 non-null   int64  
 15  NumWebPurchases      2240 non-null   int64  
 16  NumCatalogPurchases  2240 non-null   int64  
 17  NumStorePurchases    2240 non-null   int64  
 18  NumWebVisitsMonth    2240 non-null   int64  
 19  AcceptedCmp3         2240 non-null   int64  
 20  AcceptedCmp4         2240 non-null   int64  
 21  AcceptedCmp5         2240 non-null   int64  
 22  AcceptedCmp1         2240 non-null   int64  
 23  AcceptedCmp2         2240 non-null   int64  
 24  Complain             2240 non-null   int64  
 25  Response             2240 non-null   int64  
dtypes: float64(1), int64(24), object(1)
memory usage: 455.1+ KB

Univariate Analysis¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

1. Univariate Analysis - Numerical Data¶

Histograms help to visualize and describe numerical data. We can also use other plots like box plot to analyze the numerical columns.

Let us plot histogram for the feature 'Income' to understand the distribution and outliers, if any.¶

In [150]:
# Create histogram for all feature

for col in data.columns :
    print(col)
    plt.figure(figsize=(20, 7))
    sns.histplot(x=data[col], data = data )
    plt.show()
Year_Birth
Education
Marital_Status
Income
Kidhome
Teenhome
Dt_Customer
Recency
MntWines
MntFruits
MntMeatProducts
MntFishProducts
MntSweetProducts
MntGoldProds
NumDealsPurchases
NumWebPurchases
NumCatalogPurchases
NumStorePurchases
NumWebVisitsMonth
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Complain
Response

**Observations:

Year_Birth : Clearly, we have inconsistent data. This column will be transformed to the age. Education: Already normalized. Marital_Status : Already normalized. Income : Outliers to review and remove.

The following features: MntWines,MntFruits,,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds are the same unit and the same behaivor, we can summarized.

**

In [ ]:
 
In [151]:
# Plot the boxplot

for col in data.columns:
    print(col)
    try:
        sns.boxplot(data=data, x=data[col], showmeans=True, color="green")
        plt.show()
    except:
        print ('non numeric feature')
Year_Birth
Education
Marital_Status
Income
Kidhome
Teenhome
Dt_Customer
non numeric feature
Recency
MntWines
MntFruits
MntMeatProducts
MntFishProducts
MntSweetProducts
MntGoldProds
NumDealsPurchases
NumWebPurchases
NumCatalogPurchases
NumStorePurchases
NumWebVisitsMonth
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Complain
Response

**Observations and Insights:¶

Year_Birth : outliers to remove, inconsistent data. Income : There is outliers in this column Income, this value is over 600,000 as we said in the info() method, where we can see the max values is far away from the mean of the Income

MntWines : Outliers some registers above 1200, but some bottles could be expensive.

Education:

**

Think About It

  • The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
In [152]:
# Calculating the upper whisker for the Income variable

Q1 = data.quantile(q=0.25)                          # Finding the first quartile

Q3 = data.quantile(q=0.75)                          # Finding the third quartile

IQR = Q3-Q1                                       # Finding the Inter Quartile Range

upper_whisker = (Q3 + 1.5 * IQR)['Income']          # Calculating the Upper Whisker for the Income variable

print(upper_whisker)                                # Printing Upper Whisker
117416.25
In [153]:
# Let's check the observations with extreme value for the Income variable
index = data.index
ids = index[data.Income > upper_whisker]
print(ids)
print (len(ids))
Int64Index([164, 617, 655, 687, 1300, 1653, 2132, 2233], dtype='int64')
8

Observations and Insights: 95% of values are less than 83926, we can avoid the values over this¶

In [154]:
# Dropping observations identified as outliers 
# Pass the indices of the observations (separated by a comma) to drop them
data.drop(ids, inplace=True) 
In [155]:
# Let's check the observations with extreme value for the Income variable
# we have 3 records with Year_Birth less than 1940, in fact they are: 1900,1893,1899. This data is inconsistent.
# Removing this outliers.
index = data.index
ids = index[data.Year_Birth < 1940 ]
data.drop(ids,inplace=True)
In [156]:
data['Age'] = year_today - data['Year_Birth']
data['kids'] = data['Kidhome'] + data['Teenhome']
In [157]:
data = data.drop(['Year_Birth'], axis =1)
In [158]:
# In the same way as Age and Year_Birth, we can calculate the Engaged_in_days in account de DT_Customer.

data["ageing_customer"] = datetime.today() -  pd.to_datetime(data['Dt_Customer'], format="%d-%m-%Y")

data["ageing_customer"] = data["ageing_customer"].dt.days

# now we can remove Dt_Customer.

data = data.drop(['Dt_Customer'], axis =1)
In [159]:
data.head()
Out[159]:
Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts ... AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age kids ageing_customer
0 1 0 58138.0 0 0 58 635 88 546 172 ... 0 0 0 0 0 0 1 65 0 3684
1 1 0 46344.0 1 1 38 11 1 6 2 ... 0 0 0 0 0 0 0 68 2 3134
2 1 1 71613.0 0 0 26 426 49 127 111 ... 0 0 0 0 0 0 0 57 0 3333
3 1 1 26646.0 1 0 26 11 4 20 10 ... 0 0 0 0 0 0 0 38 1 3160
4 2 1 58293.0 1 0 94 173 43 118 46 ... 0 0 0 0 0 0 0 41 1 3182

5 rows × 27 columns

Now, let's check the distribution of the Income variable after dropping outliers.

In [160]:
# Plot histogram by Columns

for col in data.columns :
    print(col)
    sns.histplot(data=data, x=data[col], color="violet")
    plt.show()
Education
Marital_Status
Income
Kidhome
Teenhome
Recency
MntWines
MntFruits
MntMeatProducts
MntFishProducts
MntSweetProducts
MntGoldProds
NumDealsPurchases
NumWebPurchases
NumCatalogPurchases
NumStorePurchases
NumWebVisitsMonth
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Complain
Response
Age
kids
ageing_customer
In [161]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2229 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Education            2229 non-null   int64  
 1   Marital_Status       2229 non-null   int64  
 2   Income               2229 non-null   float64
 3   Kidhome              2229 non-null   int64  
 4   Teenhome             2229 non-null   int64  
 5   Recency              2229 non-null   int64  
 6   MntWines             2229 non-null   int64  
 7   MntFruits            2229 non-null   int64  
 8   MntMeatProducts      2229 non-null   int64  
 9   MntFishProducts      2229 non-null   int64  
 10  MntSweetProducts     2229 non-null   int64  
 11  MntGoldProds         2229 non-null   int64  
 12  NumDealsPurchases    2229 non-null   int64  
 13  NumWebPurchases      2229 non-null   int64  
 14  NumCatalogPurchases  2229 non-null   int64  
 15  NumStorePurchases    2229 non-null   int64  
 16  NumWebVisitsMonth    2229 non-null   int64  
 17  AcceptedCmp3         2229 non-null   int64  
 18  AcceptedCmp4         2229 non-null   int64  
 19  AcceptedCmp5         2229 non-null   int64  
 20  AcceptedCmp1         2229 non-null   int64  
 21  AcceptedCmp2         2229 non-null   int64  
 22  Complain             2229 non-null   int64  
 23  Response             2229 non-null   int64  
 24  Age                  2229 non-null   int64  
 25  kids                 2229 non-null   int64  
 26  ageing_customer      2229 non-null   int64  
dtypes: float64(1), int64(26)
memory usage: 552.1 KB
In [162]:
# Plot the histogram for all columns
for col in data.columns:
    print(col)
    sns.boxplot(data=data, x=data[col], showmeans=True, color="red")
    plt.show()
Education
Marital_Status
Income
Kidhome
Teenhome
Recency
MntWines
MntFruits
MntMeatProducts
MntFishProducts
MntSweetProducts
MntGoldProds
NumDealsPurchases
NumWebPurchases
NumCatalogPurchases
NumStorePurchases
NumWebVisitsMonth
AcceptedCmp3
AcceptedCmp4
AcceptedCmp5
AcceptedCmp1
AcceptedCmp2
Complain
Response
Age
kids
ageing_customer

2. Univariate analysis - Categorical Data¶

Let us write a function that will help us create bar plots that indicate the percentage for each category. This function takes the categorical column as the input and returns the bar plot for the variable.

In [163]:
def perc_on_bar(z):
    '''
    plot
    feature: categorical feature
    the function won't work if a column is passed in hue parameter
    '''

    total = len(data[z])                                          # Length of the column
    plt.figure(figsize=(15,5))
    ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # Percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05                  # Width of the plot
        y = p.get_y() + p.get_height()                            # Height of the plot
        
        ax.annotate(percentage, (x, y), size = 12)                # Annotate the percentage 
    plt.show()                                                    # Show the plot

Let us plot barplot for the variable Marital_Status.¶

Bivariate Analysis¶

We have analyzed different categorical and numerical variables. Now, let's check how different variables are related to each other.

Correlation Heat map¶

Heat map can show a 2D correlation matrix between numerical features.

In [164]:
corr_matrix =data.iloc[ : ,0 : 26].corr()

corr_matrix.T
Out[164]:
Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts ... NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age kids
Education 1.000000 0.014989 0.119401 -0.031041 0.088048 -0.028680 0.133817 -0.070698 0.006975 -0.061636 ... -0.041767 0.000293 0.035424 0.011249 -0.008086 0.000209 -0.025773 0.066127 0.123008 0.041548
Marital_Status 0.014989 1.000000 -0.012374 0.023602 0.031609 -0.002807 -0.006189 -0.025400 -0.028011 -0.017265 ... 0.006227 -0.018034 -0.005650 0.016727 0.008502 0.002623 0.001146 -0.150114 0.001459 0.039804
Income 0.119401 -0.012374 1.000000 -0.527213 0.042271 0.006777 0.728021 0.535246 0.693754 0.549958 ... -0.644450 -0.011268 0.231291 0.414982 0.342897 0.104420 -0.027517 0.174419 0.211486 -0.347018
Kidhome -0.031041 0.023602 -0.527213 1.000000 -0.036461 0.009144 -0.498230 -0.373563 -0.443248 -0.388645 ... 0.448826 0.014500 -0.162170 -0.205456 -0.172910 -0.080084 0.036294 -0.080514 -0.234469 0.689881
Teenhome 0.088048 0.031609 0.042271 -0.036461 1.000000 0.016246 0.003899 -0.177128 -0.267046 -0.205301 ... 0.133056 -0.043369 0.038374 -0.191018 -0.140964 -0.012465 0.007442 -0.155726 0.364148 0.698287
Recency -0.028680 -0.002807 0.006777 0.009144 0.016246 1.000000 0.016797 -0.003376 0.026651 0.001694 ... -0.020849 -0.033067 0.018916 0.000942 -0.019317 -0.007946 0.005404 -0.199078 0.017956 0.018318
MntWines 0.133817 -0.006189 0.728021 -0.498230 0.003899 0.016797 1.000000 0.387591 0.586156 0.398024 ... -0.327949 0.061456 0.373206 0.471790 0.354069 0.211109 -0.036257 0.246534 0.162628 -0.354055
MntFruits -0.070698 -0.025400 0.535246 -0.373563 -0.177128 -0.003376 0.387591 1.000000 0.562744 0.593941 ... -0.424034 0.014442 0.009850 0.211630 0.195001 -0.010268 -0.002988 0.125192 0.013360 -0.395893
MntMeatProducts 0.006975 -0.028011 0.693754 -0.443248 -0.267046 0.026651 0.586156 0.562744 1.000000 0.589813 ... -0.543223 0.020987 0.108349 0.385770 0.321606 0.034262 -0.020647 0.247103 0.038417 -0.510945
MntFishProducts -0.061636 -0.017265 0.549958 -0.388645 -0.205301 0.001694 0.398024 0.593941 0.589813 1.000000 ... -0.452835 -0.000243 0.016257 0.197715 0.260545 0.005094 -0.019039 0.110609 0.042064 -0.427105
MntSweetProducts -0.063742 -0.019407 0.549835 -0.371720 -0.163651 0.023071 0.384885 0.566545 0.544347 0.578983 ... -0.430205 0.000862 0.027998 0.258439 0.241486 0.009801 -0.020632 0.116545 0.018938 -0.384810
MntGoldProds -0.085734 -0.029930 0.412596 -0.351059 -0.021478 0.017559 0.384979 0.389145 0.366044 0.421147 ... -0.255997 0.123107 0.021862 0.175822 0.166619 0.049923 -0.030170 0.139726 0.056934 -0.267020
NumDealsPurchases 0.028340 0.024065 -0.131051 0.231266 0.395713 -0.002780 0.013908 -0.133856 -0.165686 -0.141021 ... 0.364861 -0.022863 0.016711 -0.185887 -0.125270 -0.034008 0.004024 0.003424 0.074089 0.452311
NumWebPurchases 0.059840 0.001126 0.493599 -0.365123 0.154624 -0.010256 0.540786 0.295507 0.315014 0.291781 ... -0.065861 0.040796 0.154940 0.138128 0.154292 0.034383 -0.013718 0.147032 0.153410 -0.149525
NumCatalogPurchases 0.045244 -0.017153 0.706752 -0.517305 -0.111976 0.030805 0.673390 0.515184 0.703668 0.564718 ... -0.528833 0.112332 0.148547 0.339294 0.325324 0.106978 -0.018265 0.235325 0.140002 -0.451658
NumStorePurchases 0.072788 0.007311 0.683315 -0.504817 0.048310 0.001671 0.641490 0.462344 0.509984 0.459031 ... -0.445140 -0.069873 0.178389 0.215589 0.182424 0.084811 -0.012108 0.036697 0.138852 -0.326601
NumWebVisitsMonth -0.041767 0.006227 -0.644450 0.448826 0.133056 -0.020849 -0.327949 -0.424034 -0.543223 -0.452835 ... 1.000000 0.059610 -0.034455 -0.279959 -0.195869 -0.004415 0.020326 -0.007097 -0.119885 0.417878
AcceptedCmp3 0.000293 -0.018034 -0.011268 0.014500 -0.043369 -0.033067 0.061456 0.014442 0.020987 -0.000243 ... 0.059610 1.000000 -0.079936 0.080672 0.094422 0.058992 0.009822 0.253838 -0.061412 -0.021031
AcceptedCmp4 0.035424 -0.005650 0.231291 -0.162170 0.038374 0.018916 0.373206 0.009850 0.108349 0.016257 ... -0.034455 -0.079936 1.000000 0.307615 0.251025 0.298167 -0.027079 0.176544 0.064121 -0.088363
AcceptedCmp5 0.011249 0.016727 0.414982 -0.205456 -0.191018 0.000942 0.471790 0.211630 0.385770 0.197715 ... -0.279959 0.080672 0.307615 1.000000 0.404459 0.211859 -0.008312 0.327911 -0.015654 -0.285541
AcceptedCmp1 -0.008086 0.008502 0.342897 -0.172910 -0.140964 -0.019317 0.354069 0.195001 0.321606 0.260545 ... -0.195869 0.094422 0.251025 0.404459 1.000000 0.163089 -0.025006 0.293617 0.007949 -0.225970
AcceptedCmp2 0.000209 0.002623 0.104420 -0.080084 -0.012465 -0.007946 0.211109 -0.010268 0.034262 0.005094 ... -0.004415 0.058992 0.298167 0.211859 0.163089 1.000000 -0.010925 0.162553 0.014806 -0.066393
Complain -0.025773 0.001146 -0.027517 0.036294 0.007442 0.005404 -0.036257 -0.002988 -0.020647 -0.019039 ... 0.020326 0.009822 -0.027079 -0.008312 -0.025006 -0.010925 1.000000 0.000042 0.004380 0.031388
Response 0.066127 -0.150114 0.174419 -0.080514 -0.155726 -0.199078 0.246534 0.125192 0.247103 0.110609 ... -0.007097 0.253838 0.176544 0.327911 0.293617 0.162553 0.000042 1.000000 -0.018918 -0.170480
Age 0.123008 0.001459 0.211486 -0.234469 0.364148 0.017956 0.162628 0.013360 0.038417 0.042064 ... -0.119885 -0.061412 0.064121 -0.015654 0.007949 0.014806 0.004380 -0.018918 1.000000 0.095842
kids 0.041548 0.039804 -0.347018 0.689881 0.698287 0.018318 -0.354055 -0.395893 -0.510945 -0.427105 ... 0.417878 -0.021031 -0.088363 -0.285541 -0.225970 -0.066393 0.031388 -0.170480 0.095842 1.000000

26 rows × 26 columns

In [165]:
plt.figure(figsize=(15, 7))                                                        # Setting the plot size
sns.heatmap(corr_matrix, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")  # Plotting the correlation plot
plt.show()

Observations and Insights: _¶

We have positive relations between Income and MntWines, NumCatalogPurchases and MntMeatProducts, NumCatalogPurchases and Income/

There is negative relations between: NumWebVisitsMonth and Income

The above correlation heatmap only shows the relationship between numerical variables. Let's check the relationship of numerical variables with categorical variables.

Education Vs Income¶

In [166]:
print(sns.barplot(x=data['Education'], y=data['Income'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
In [247]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 1 to 2239
Data columns (total 32 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Education            1680 non-null   int64  
 1   Marital_Status       1680 non-null   int64  
 2   Income               1680 non-null   float64
 3   Kidhome              1680 non-null   int64  
 4   Teenhome             1680 non-null   int64  
 5   Recency              1680 non-null   int64  
 6   MntWines             1680 non-null   int64  
 7   MntFruits            1680 non-null   int64  
 8   MntMeatProducts      1680 non-null   int64  
 9   MntFishProducts      1680 non-null   int64  
 10  MntSweetProducts     1680 non-null   int64  
 11  MntGoldProds         1680 non-null   int64  
 12  NumDealsPurchases    1680 non-null   int64  
 13  NumWebPurchases      1680 non-null   int64  
 14  NumCatalogPurchases  1680 non-null   int64  
 15  NumStorePurchases    1680 non-null   int64  
 16  NumWebVisitsMonth    1680 non-null   int64  
 17  AcceptedCmp3         1680 non-null   int64  
 18  AcceptedCmp4         1680 non-null   int64  
 19  AcceptedCmp5         1680 non-null   int64  
 20  AcceptedCmp1         1680 non-null   int64  
 21  AcceptedCmp2         1680 non-null   int64  
 22  Complain             1680 non-null   int64  
 23  Response             1680 non-null   int64  
 24  Age                  1680 non-null   int64  
 25  kids                 1680 non-null   int64  
 26  ageing_customer      1680 non-null   int64  
 27  Family_Size          1680 non-null   int64  
 28  Expenses             1680 non-null   int64  
 29  NumTotalPurchases    1680 non-null   int64  
 30  TotalAcceptedCmp     1680 non-null   int64  
 31  AmountPerPurchase    1680 non-null   int64  
dtypes: float64(1), int64(31)
memory usage: 497.7 KB
In [246]:
print(sns.barplot(x=data['Education'], y=data['Expenses'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)

**Observations and Insights:¶

0 : Basic Education has the lowest income. Graduation and Master could be joined Las categorias Graduation y Master tienen income similares, en terminos generales, podriamos simplificar.

In [167]:
data['Education'] = np.where(data['Education'] == '2', '1', data['Education'])  
data['Education'] = data['Education'].astype('int64')

Marital Status Vs Income¶

In [168]:
# Plot the bar plot for Marital_Status and Income
print(sns.barplot(x=data['Marital_Status'], y=data['Income'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)

**Observations and Insights:¶

No differences between Marital_Status and the income.

**

Kidhome Vs Income¶

In [249]:
# Plot the bar plot for Kidhome and Income
print(sns.barplot(x=data['kids'], y=data['Income'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
In [250]:
# Plot the bar plot for Kidhome and Income
print(sns.barplot(x=data['kids'], y=data['Expenses'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)

Observations and Insights: No significant differences between kids variable¶

We can also visualize the relationship between two categorical variables.

Marital_Status Vs Kidhome¶

In [170]:
# Plot the bar plot for Marital_Status and Kidhome
pd.crosstab(data['Marital_Status'],data['kids']).plot(kind='hist', stacked=False)
Out[170]:
<AxesSubplot:ylabel='Frequency'>

Family Size¶

  • Let's create a new variable called 'Family Size' to find out how many members each family has.
  • For this, we need to have a look at the Marital_Status variable, and see what are the categories.
  • We can combine the sub-categories Single, Divorced, Widow as "Single" and we can combine the sub-categories Married and Together as "Relationship"
  • Then we can create a new variable called "Status" and assign values 1 and 2 to categories Single and Relationship, respectively.
  • Then, we can use the Kids (calculated above) and the Status column to find the family size.
In [171]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2229 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Education            2229 non-null   int64  
 1   Marital_Status       2229 non-null   int64  
 2   Income               2229 non-null   float64
 3   Kidhome              2229 non-null   int64  
 4   Teenhome             2229 non-null   int64  
 5   Recency              2229 non-null   int64  
 6   MntWines             2229 non-null   int64  
 7   MntFruits            2229 non-null   int64  
 8   MntMeatProducts      2229 non-null   int64  
 9   MntFishProducts      2229 non-null   int64  
 10  MntSweetProducts     2229 non-null   int64  
 11  MntGoldProds         2229 non-null   int64  
 12  NumDealsPurchases    2229 non-null   int64  
 13  NumWebPurchases      2229 non-null   int64  
 14  NumCatalogPurchases  2229 non-null   int64  
 15  NumStorePurchases    2229 non-null   int64  
 16  NumWebVisitsMonth    2229 non-null   int64  
 17  AcceptedCmp3         2229 non-null   int64  
 18  AcceptedCmp4         2229 non-null   int64  
 19  AcceptedCmp5         2229 non-null   int64  
 20  AcceptedCmp1         2229 non-null   int64  
 21  AcceptedCmp2         2229 non-null   int64  
 22  Complain             2229 non-null   int64  
 23  Response             2229 non-null   int64  
 24  Age                  2229 non-null   int64  
 25  kids                 2229 non-null   int64  
 26  ageing_customer      2229 non-null   int64  
dtypes: float64(1), int64(26)
memory usage: 552.1 KB
In [172]:
# Add two variables Status and Kids to get the total number of persons in each family
data["Family_Size"] = data['Marital_Status'] + data['kids'] + 1
In [173]:
data.head()
Out[173]:
Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts ... AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age kids ageing_customer Family_Size
0 1 0 58138.0 0 0 58 635 88 546 172 ... 0 0 0 0 0 1 65 0 3684 1
1 1 0 46344.0 1 1 38 11 1 6 2 ... 0 0 0 0 0 0 68 2 3134 3
2 1 1 71613.0 0 0 26 426 49 127 111 ... 0 0 0 0 0 0 57 0 3333 2
3 1 1 26646.0 1 0 26 11 4 20 10 ... 0 0 0 0 0 0 38 1 3160 3
4 2 1 58293.0 1 0 94 173 43 118 46 ... 0 0 0 0 0 0 41 1 3182 3

5 rows × 28 columns

Expenses¶

  • Let's create a new feature called "Expenses", indicating the total amount spent by the customers in various products over the span of two years.
In [174]:
# Create a new feature
# Add the amount spent on each of product 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'
data["Expenses"] = data['MntWines'] + data['MntFruits'] + data['MntMeatProducts'] + data['MntFishProducts'] + data['MntSweetProducts'] + data['MntGoldProds']
In [175]:
data.tail()
Out[175]:
Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts ... AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age kids ageing_customer Family_Size Expenses
2235 1 1 61223.0 0 1 46 709 43 182 42 ... 0 0 0 0 0 55 1 3402 3 1341
2236 2 1 64014.0 2 1 56 406 0 30 0 ... 0 1 0 0 0 76 3 3040 5 444
2237 1 0 56981.0 0 0 91 908 48 217 32 ... 0 0 0 0 0 41 0 3176 1 1241
2238 2 1 69245.0 0 1 8 428 30 214 80 ... 0 0 0 0 0 66 1 3177 3 843
2239 2 1 52869.0 1 1 40 84 3 61 2 ... 0 0 0 0 1 68 2 3643 4 172

5 rows × 29 columns

Total Purchases¶

  • Let's create a new feature called "NumTotalPurchases", indicating the total number of products purchased by the customers.
In [176]:
# Create a new feature
# Add the number of purchases from each channel 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases'
data["NumTotalPurchases"] = data['NumDealsPurchases'] + data['NumWebPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases']
In [177]:
data.describe()
Out[177]:
Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts ... AcceptedCmp1 AcceptedCmp2 Complain Response Age kids ageing_customer Family_Size Expenses NumTotalPurchases
count 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 ... 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000 2229.000000
mean 1.447286 0.644235 51628.825945 0.444594 0.507402 49.106326 304.991476 26.348587 165.283984 37.642441 ... 0.064603 0.013010 0.008973 0.149843 53.111261 0.951996 3374.777479 2.596231 605.541050 14.869000
std 0.543881 0.478852 20601.302353 0.538636 0.544735 28.946476 336.761943 39.764060 219.336589 54.700548 ... 0.245879 0.113344 0.094319 0.356998 11.707430 0.751976 202.248700 0.907432 601.032228 7.622187
min 0.000000 0.000000 1730.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 26.000000 0.000000 3021.000000 1.000000 5.000000 0.000000
25% 1.000000 0.000000 35416.000000 0.000000 0.000000 24.000000 24.000000 2.000000 16.000000 3.000000 ... 0.000000 0.000000 0.000000 0.000000 45.000000 0.000000 3202.000000 2.000000 69.000000 8.000000
50% 1.000000 1.000000 51563.000000 0.000000 0.000000 49.000000 176.000000 8.000000 67.000000 12.000000 ... 0.000000 0.000000 0.000000 0.000000 52.000000 1.000000 3377.000000 3.000000 397.000000 15.000000
75% 2.000000 1.000000 68118.000000 1.000000 1.000000 74.000000 505.000000 33.000000 231.000000 50.000000 ... 0.000000 0.000000 0.000000 0.000000 63.000000 1.000000 3550.000000 3.000000 1044.000000 21.000000
max 2.000000 1.000000 113734.000000 2.000000 2.000000 99.000000 1493.000000 199.000000 1725.000000 259.000000 ... 1.000000 1.000000 1.000000 1.000000 82.000000 3.000000 3720.000000 5.000000 2525.000000 43.000000

8 rows × 30 columns

TotalAcceptedCmp¶

  • Let's create a new feature called "TotalAcceptedCmp" that shows how many offers customers have accepted.
In [178]:
# Add all the campaign related variables to get the total number of accepted campaigns by a customer
# "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "Response"
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"] + data["Response"]
In [179]:
data.head(100)
Out[179]:
Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts ... AcceptedCmp2 Complain Response Age kids ageing_customer Family_Size Expenses NumTotalPurchases TotalAcceptedCmp
0 1 0 58138.0 0 0 58 635 88 546 172 ... 0 0 1 65 0 3684 1 1617 25 1
1 1 0 46344.0 1 1 38 11 1 6 2 ... 0 0 0 68 2 3134 3 27 6 0
2 1 1 71613.0 0 0 26 426 49 127 111 ... 0 0 0 57 0 3333 2 776 21 0
3 1 1 26646.0 1 0 26 11 4 20 10 ... 0 0 0 38 1 3160 3 53 8 0
4 2 1 58293.0 1 0 94 173 43 118 46 ... 0 0 0 41 1 3182 3 422 19 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 1 1 30096.0 1 0 30 5 3 11 12 ... 0 0 0 39 1 3059 3 45 6 0
96 1 0 47916.0 0 1 72 505 0 26 0 ... 0 0 0 62 1 3605 2 606 22 1
97 1 0 51813.0 1 1 37 51 2 7 0 ... 0 0 0 50 2 3465 3 63 7 0
98 2 1 78497.0 0 0 44 207 26 447 75 ... 0 0 0 71 0 3231 2 978 25 1
99 2 1 50150.0 0 0 32 135 46 92 65 ... 0 0 0 39 0 3395 2 410 16 0

100 rows × 31 columns

AmountPerPurchase¶

  • Let's create a new feature called "AmountPerPurchase" indicating the amount spent per purchase.
In [ ]:
 
In [180]:
# Divide the "Expebnses" by "NumTotalPurchases" to create the new feature AmountPerPurchase 

# Let's get the observations equal 0
index = data.index
ids = index[data.NumTotalPurchases == 0]
print(ids)
Int64Index([981, 1524], dtype='int64')
In [181]:
data['Expenses'].describe()
Out[181]:
count    2229.000000
mean      605.541050
std       601.032228
min         5.000000
25%        69.000000
50%       397.000000
75%      1044.000000
max      2525.000000
Name: Expenses, dtype: float64
In [182]:
data.drop(ids, inplace=True) 


data['AmountPerPurchase'] = data['Expenses'] / data["NumTotalPurchases"]
In [183]:
data.describe().T
Out[183]:
count mean std min 25% 50% 75% max
Education 2227.0 1.447687 0.543960 0.000000 1.000000 1.000000 2.000000 2.0
Marital_Status 2227.0 0.644814 0.478677 0.000000 0.000000 1.000000 1.000000 1.0
Income 2227.0 51671.436925 20561.373156 1730.000000 35482.000000 51569.000000 68118.000000 113734.0
Kidhome 2227.0 0.444544 0.538667 0.000000 0.000000 0.000000 1.000000 2.0
Teenhome 2227.0 0.507858 0.544767 0.000000 0.000000 0.000000 1.000000 2.0
Recency 2227.0 49.116300 28.952535 0.000000 24.000000 49.000000 74.000000 99.0
MntWines 2227.0 305.263583 336.790652 0.000000 24.000000 177.000000 505.000000 1493.0
MntFruits 2227.0 26.371352 39.774657 0.000000 2.000000 8.000000 33.000000 199.0
MntMeatProducts 2227.0 165.431522 219.379791 0.000000 16.000000 68.000000 231.500000 1725.0
MntFishProducts 2227.0 37.675797 54.713778 0.000000 3.000000 12.000000 50.000000 259.0
MntSweetProducts 2227.0 27.185900 41.349490 0.000000 1.000000 8.000000 34.000000 263.0
MntGoldProds 2227.0 44.151774 52.098612 0.000000 9.000000 25.000000 56.500000 362.0
NumDealsPurchases 2227.0 2.321958 1.894826 0.000000 1.000000 2.000000 3.000000 15.0
NumWebPurchases 2227.0 4.103727 2.773984 0.000000 2.000000 4.000000 6.000000 27.0
NumCatalogPurchases 2227.0 2.638976 2.796299 0.000000 0.000000 2.000000 4.000000 28.0
NumStorePurchases 2227.0 5.817692 3.238913 0.000000 3.000000 5.000000 8.000000 13.0
NumWebVisitsMonth 2227.0 5.326448 2.402050 0.000000 3.000000 6.000000 7.000000 20.0
AcceptedCmp3 2227.0 0.073193 0.260511 0.000000 0.000000 0.000000 0.000000 1.0
AcceptedCmp4 2227.0 0.074989 0.263432 0.000000 0.000000 0.000000 0.000000 1.0
AcceptedCmp5 2227.0 0.072744 0.259774 0.000000 0.000000 0.000000 0.000000 1.0
AcceptedCmp1 2227.0 0.064661 0.245982 0.000000 0.000000 0.000000 0.000000 1.0
AcceptedCmp2 2227.0 0.013022 0.113394 0.000000 0.000000 0.000000 0.000000 1.0
Complain 2227.0 0.008981 0.094361 0.000000 0.000000 0.000000 0.000000 1.0
Response 2227.0 0.149978 0.357130 0.000000 0.000000 0.000000 0.000000 1.0
Age 2227.0 53.111361 11.712074 26.000000 45.000000 52.000000 63.000000 82.0
kids 2227.0 0.952402 0.752043 0.000000 0.000000 1.000000 1.000000 3.0
ageing_customer 2227.0 3374.893579 202.197318 3021.000000 3202.000000 3377.000000 3550.000000 3720.0
Family_Size 2227.0 2.597216 0.907120 1.000000 2.000000 3.000000 3.000000 5.0
Expenses 2227.0 606.079928 601.032876 8.000000 69.000000 397.000000 1044.500000 2525.0
NumTotalPurchases 2227.0 14.882353 7.612563 1.000000 8.000000 15.000000 21.000000 43.0
TotalAcceptedCmp 2227.0 0.448586 0.890416 0.000000 0.000000 0.000000 1.000000 5.0
AmountPerPurchase 2227.0 33.274270 45.040897 0.533333 9.714286 23.352941 45.281773 1679.0

Now, let's check the maximum value of the AmountPerPurchase.

Think About It:

  • Is the maximum value in the above output valid? What could be the potential reason for such output?
  • How many such values are there? Can we drop such observations?
In [184]:
# Let's get the observations equal 0
index = data.index
ids = index[data.AmountPerPurchase > 46]
print(ids)
data.drop(ids, inplace=True) 
#ids = index[data.AmountPerPurchase == 0]
#print(ids)
#data.drop(ids, inplace=True) 
Int64Index([   0,   12,   15,   18,   22,   29,   34,   40,   45,   49,
            ...
            2193, 2194, 2206, 2211, 2213, 2217, 2221, 2228, 2235, 2237],
           dtype='int64', length=547)
In [185]:
#
data.describe()
Out[185]:
Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts ... Complain Response Age kids ageing_customer Family_Size Expenses NumTotalPurchases TotalAcceptedCmp AmountPerPurchase
count 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 ... 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000
mean 1.439286 0.655357 44111.917724 0.573214 0.587500 48.441071 170.642857 15.061905 71.968452 20.723810 ... 0.010714 0.102381 52.744643 1.160714 3364.927381 2.816071 328.027976 13.267262 0.249405 18.617871
std 0.557480 0.475393 17049.412847 0.549511 0.540849 29.039459 211.093201 27.958251 98.910845 36.262361 ... 0.102984 0.303239 11.277100 0.687074 199.717819 0.845464 345.212527 7.767198 0.613769 11.924125
min 0.000000 0.000000 1730.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 26.000000 0.000000 3021.000000 1.000000 8.000000 1.000000 0.000000 0.533333
25% 1.000000 0.000000 31486.250000 0.000000 0.000000 24.000000 15.000000 1.000000 11.000000 2.000000 ... 0.000000 0.000000 44.750000 1.000000 3190.000000 2.000000 54.000000 7.000000 0.000000 8.000000
50% 1.000000 1.000000 43141.000000 1.000000 1.000000 48.000000 70.000000 4.000000 30.000000 7.000000 ... 0.000000 0.000000 51.000000 1.000000 3360.000000 3.000000 166.000000 11.000000 0.000000 16.118056
75% 2.000000 1.000000 56349.250000 1.000000 1.000000 74.000000 267.000000 15.000000 99.000000 20.000000 ... 0.000000 0.000000 62.000000 2.000000 3533.000000 3.000000 536.500000 19.000000 0.000000 27.616228
max 2.000000 1.000000 113734.000000 2.000000 2.000000 99.000000 1200.000000 199.000000 1725.000000 259.000000 ... 1.000000 1.000000 82.000000 3.000000 3720.000000 5.000000 1730.000000 43.000000 5.000000 45.960000

8 rows × 32 columns

Now, let's check the distribution of values in AmountPerPurchase column.

In [186]:
# Check the summary statistics of the AmountPerPurchase variable 

data['AmountPerPurchase'] = data['AmountPerPurchase'].astype('int64')
In [187]:
data.AmountPerPurchase.describe()
Out[187]:
count    1680.000000
mean       18.167857
std        11.919990
min         0.000000
25%         8.000000
50%        16.000000
75%        27.000000
max        45.000000
Name: AmountPerPurchase, dtype: float64
In [188]:
# Plot the histogram for the AmountPerPurchas variable

plt.figure(figsize=(20, 7))
sns.histplot(x=data['AmountPerPurchase'], data = data )
plt.show()

Imputing Missing Values¶

In [189]:
# Impute the missing values for the Income variable with the median
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 1 to 2239
Data columns (total 32 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Education            1680 non-null   int64  
 1   Marital_Status       1680 non-null   int64  
 2   Income               1680 non-null   float64
 3   Kidhome              1680 non-null   int64  
 4   Teenhome             1680 non-null   int64  
 5   Recency              1680 non-null   int64  
 6   MntWines             1680 non-null   int64  
 7   MntFruits            1680 non-null   int64  
 8   MntMeatProducts      1680 non-null   int64  
 9   MntFishProducts      1680 non-null   int64  
 10  MntSweetProducts     1680 non-null   int64  
 11  MntGoldProds         1680 non-null   int64  
 12  NumDealsPurchases    1680 non-null   int64  
 13  NumWebPurchases      1680 non-null   int64  
 14  NumCatalogPurchases  1680 non-null   int64  
 15  NumStorePurchases    1680 non-null   int64  
 16  NumWebVisitsMonth    1680 non-null   int64  
 17  AcceptedCmp3         1680 non-null   int64  
 18  AcceptedCmp4         1680 non-null   int64  
 19  AcceptedCmp5         1680 non-null   int64  
 20  AcceptedCmp1         1680 non-null   int64  
 21  AcceptedCmp2         1680 non-null   int64  
 22  Complain             1680 non-null   int64  
 23  Response             1680 non-null   int64  
 24  Age                  1680 non-null   int64  
 25  kids                 1680 non-null   int64  
 26  ageing_customer      1680 non-null   int64  
 27  Family_Size          1680 non-null   int64  
 28  Expenses             1680 non-null   int64  
 29  NumTotalPurchases    1680 non-null   int64  
 30  TotalAcceptedCmp     1680 non-null   int64  
 31  AmountPerPurchase    1680 non-null   int64  
dtypes: float64(1), int64(31)
memory usage: 497.7 KB

Now that we are done with data preprocessing, let's visualize new features against the new income variable we have after imputing missing values.

Income Vs Expenses¶

In [190]:
data.describe()
Out[190]:
Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts ... Complain Response Age kids ageing_customer Family_Size Expenses NumTotalPurchases TotalAcceptedCmp AmountPerPurchase
count 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 ... 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000
mean 1.439286 0.655357 44111.917724 0.573214 0.587500 48.441071 170.642857 15.061905 71.968452 20.723810 ... 0.010714 0.102381 52.744643 1.160714 3364.927381 2.816071 328.027976 13.267262 0.249405 18.167857
std 0.557480 0.475393 17049.412847 0.549511 0.540849 29.039459 211.093201 27.958251 98.910845 36.262361 ... 0.102984 0.303239 11.277100 0.687074 199.717819 0.845464 345.212527 7.767198 0.613769 11.919990
min 0.000000 0.000000 1730.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 26.000000 0.000000 3021.000000 1.000000 8.000000 1.000000 0.000000 0.000000
25% 1.000000 0.000000 31486.250000 0.000000 0.000000 24.000000 15.000000 1.000000 11.000000 2.000000 ... 0.000000 0.000000 44.750000 1.000000 3190.000000 2.000000 54.000000 7.000000 0.000000 8.000000
50% 1.000000 1.000000 43141.000000 1.000000 1.000000 48.000000 70.000000 4.000000 30.000000 7.000000 ... 0.000000 0.000000 51.000000 1.000000 3360.000000 3.000000 166.000000 11.000000 0.000000 16.000000
75% 2.000000 1.000000 56349.250000 1.000000 1.000000 74.000000 267.000000 15.000000 99.000000 20.000000 ... 0.000000 0.000000 62.000000 2.000000 3533.000000 3.000000 536.500000 19.000000 0.000000 27.000000
max 2.000000 1.000000 113734.000000 2.000000 2.000000 99.000000 1200.000000 199.000000 1725.000000 259.000000 ... 1.000000 1.000000 82.000000 3.000000 3720.000000 5.000000 1730.000000 43.000000 5.000000 45.000000

8 rows × 32 columns

In [191]:
# Plot the scatter plot with Expenses on Y-axis and Income on X-axis  

plt.figure(figsize=(20, 10))                                    # Setting the plot size

sns.scatterplot(x = 'Income', y = 'Expenses', data = data)                                        # Hint: Use sns.scatterplot()  

plt.xticks(fontsize=16)                                         # Font size of X-label

plt.yticks(fontsize=16)                                         # Font size of Y-label

plt.xlabel("Income", fontsize=20, labelpad=20)                  # Title of X-axis

plt.ylabel("Expenses", fontsize=20, labelpad=20)                # Title of Y-axis
Out[191]:
Text(0, 0.5, 'Expenses')
In [ ]:
 

Observations and Insights: More income more expense ¶

Family Size Vs Income¶

In [192]:
# Plot the bar plot for Family Size on X-axis and Income on Y-axis

print(sns.barplot(x=data['Family_Size'], y=data['Income'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
In [193]:
print(sns.barplot(x=data['Family_Size'], y=data['Expenses'], data=data))
AxesSubplot(0.125,0.125;0.775x0.755)

Observations and Insights: The tendence is so similar¶

In [194]:
data.describe()
Out[194]:
Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts ... Complain Response Age kids ageing_customer Family_Size Expenses NumTotalPurchases TotalAcceptedCmp AmountPerPurchase
count 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 ... 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000 1680.000000
mean 1.439286 0.655357 44111.917724 0.573214 0.587500 48.441071 170.642857 15.061905 71.968452 20.723810 ... 0.010714 0.102381 52.744643 1.160714 3364.927381 2.816071 328.027976 13.267262 0.249405 18.167857
std 0.557480 0.475393 17049.412847 0.549511 0.540849 29.039459 211.093201 27.958251 98.910845 36.262361 ... 0.102984 0.303239 11.277100 0.687074 199.717819 0.845464 345.212527 7.767198 0.613769 11.919990
min 0.000000 0.000000 1730.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 26.000000 0.000000 3021.000000 1.000000 8.000000 1.000000 0.000000 0.000000
25% 1.000000 0.000000 31486.250000 0.000000 0.000000 24.000000 15.000000 1.000000 11.000000 2.000000 ... 0.000000 0.000000 44.750000 1.000000 3190.000000 2.000000 54.000000 7.000000 0.000000 8.000000
50% 1.000000 1.000000 43141.000000 1.000000 1.000000 48.000000 70.000000 4.000000 30.000000 7.000000 ... 0.000000 0.000000 51.000000 1.000000 3360.000000 3.000000 166.000000 11.000000 0.000000 16.000000
75% 2.000000 1.000000 56349.250000 1.000000 1.000000 74.000000 267.000000 15.000000 99.000000 20.000000 ... 0.000000 0.000000 62.000000 2.000000 3533.000000 3.000000 536.500000 19.000000 0.000000 27.000000
max 2.000000 1.000000 113734.000000 2.000000 2.000000 99.000000 1200.000000 199.000000 1725.000000 259.000000 ... 1.000000 1.000000 82.000000 3.000000 3720.000000 5.000000 1730.000000 43.000000 5.000000 45.000000

8 rows × 32 columns

In [196]:
data.to_csv('F:/DataScienceAI/capstone project/data_transformed.csv')
data_model = data.copy()
In [197]:
# Dropping all the irrelevant columns and storing in data_model
data_model = data_model.drop(
    columns=[
        "Complain",
        "Response",
        "AcceptedCmp1",
        "AcceptedCmp2",
        "AcceptedCmp3",
        "AcceptedCmp4",
        "AcceptedCmp5",
        "Marital_Status",
        "MntWines",             
        "MntFruits",           
        "MntMeatProducts",
        "MntFishProducts",
        "MntSweetProducts",
        "MntGoldProds",
        'Education',
        'Age', 
        "kids",
        "NumDealsPurchases", 
        "NumWebPurchases",
        "NumCatalogPurchases",
        "NumStorePurchases",
        "NumWebVisitsMonth",
        'Family_Size',
        'Kidhome',
        'Teenhome'
        
    ],
    axis=1,
)
In [198]:
# Check the shape of new data 
data_shape = data_model.shape
print(data_shape)
(1680, 7)
In [199]:
data_model.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1680 entries, 1 to 2239
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Income             1680 non-null   float64
 1   Recency            1680 non-null   int64  
 2   ageing_customer    1680 non-null   int64  
 3   Expenses           1680 non-null   int64  
 4   NumTotalPurchases  1680 non-null   int64  
 5   TotalAcceptedCmp   1680 non-null   int64  
 6   AmountPerPurchase  1680 non-null   int64  
dtypes: float64(1), int64(6)
memory usage: 169.5 KB
In [200]:
data_model.head()
Out[200]:
Income Recency ageing_customer Expenses NumTotalPurchases TotalAcceptedCmp AmountPerPurchase
1 46344.0 38 3134 27 6 0 4
2 71613.0 26 3333 776 21 0 36
3 26646.0 26 3160 53 8 0 6
4 58293.0 94 3182 422 19 0 22
5 62513.0 16 3314 716 22 0 32
In [201]:
corr_matrix =data_model.iloc[ : ,0 : 8].corr()
corr_matrix
plt.figure(figsize=(15, 7))                                                        # Setting the plot size
sns.heatmap(corr_matrix, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")  # Plotting the correlation plot
plt.show()

Observations

NumTotalPurchases and Expenses Expenses and AmountPerPurchase NumTotalPurchases and AmountPerPurchase There is no negative correlations.

K-Means¶

In [202]:
# Applying standard scaler on new data
# Initialize the Standard Scaler
scaler = StandardScaler()                                                   

# fit_transform the scaler function on new data
df_scaled = scaler.fit_transform(data_model)                                        

# Converting the embeddings to a dataframe
df_scaled = pd.DataFrame(df_scaled, columns=data_model.columns)      

df_scaled.head()
Out[202]:
Income Recency ageing_customer Expenses NumTotalPurchases TotalAcceptedCmp AmountPerPurchase
0 0.130957 -0.359655 -1.156613 -0.872267 -0.935914 -0.406471 -1.188934
1 1.613503 -0.773009 -0.159910 1.298057 0.995860 -0.406471 1.496432
2 -1.024734 -0.773009 -1.026390 -0.796929 -0.678344 -0.406471 -1.021098
3 0.832011 1.569330 -0.916202 0.272296 0.738290 -0.406471 0.321584
4 1.079601 -1.117470 -0.255073 1.124199 1.124645 -0.406471 1.160761
In [203]:
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed
# Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35
tsne = TSNE(n_components = 2, random_state =1, perplexity=35)        


data_air_pol_tsne = tsne.fit_transform(df_scaled)              # fit_transform T-SNE on new data

data_air_pol_tsne = pd.DataFrame(data_air_pol_tsne, columns=[0, 1])           # Converting the embeddings to a dataframe

plt.figure(figsize=(7, 7))                                                    # Scatter plot for two components

sns.scatterplot(x=0, y=1, data=data_air_pol_tsne)                             # Plotting T-SNE
Out[203]:
<AxesSubplot:xlabel='0', ylabel='1'>
In [204]:
# Defining the number of principal components to generate
n = data_model.shape[1]                                        # Storing the number of variables in the data

pca = PCA (n_components = n, random_state = 1)                                        # Initialize PCA with n_components = n and random_state=1

data_pca = pd.DataFrame(pca.fit_transform(df_scaled))                      # fit_transform PCA on the scaled data

# The percentage of variance explained by each principal component is stored
exp_var = pca.explained_variance_ratio_       
In [205]:
sns.scatterplot(x=0, y=1, data=data_pca)                             
Out[205]:
<AxesSubplot:xlabel='0', ylabel='1'>
In [206]:
distortions = []                                                  # Create an empty list

K = range(2, 10)                                                  # Setting the K range from 2 to 10

for k in K:
    kmeanModel = KMeans(n_clusters=k,random_state=1)              # Initialize K-Means
    kmeanModel.fit(data_pca)                                      # Fit K-Means on the data
    distortions.append(kmeanModel.inertia_)                       # Append distortion values to the empty list created above
In [207]:
# Plotting the elbow plot
plt.figure(figsize=(16, 8))                                            # Setting the plot size

plt.plot(K, distortions, "bx-")                                        # Plotting the K on X-axis and distortions on y-axis

plt.xlabel("k")                                                        # Title of x-axis

plt.ylabel("Distortion")                                               # Title of y-axis

plt.title("The Elbow Method showing the optimal k")                    # Title of the plot
plt.show()
In [208]:
data_pca.head()
Out[208]:
0 1 2 3 4 5 6
0 -1.619710 -0.366646 1.251773 0.007878 -0.488247 0.317618 0.188856
1 2.595637 -0.270793 0.799426 -0.810939 -0.368801 -0.251374 0.033878
2 -1.855042 0.102592 1.036322 -0.366075 0.564947 0.157294 0.024897
3 0.873202 -1.765879 0.094906 0.897002 -0.049336 0.369427 -0.221308
4 2.162642 0.017551 0.917874 -0.997418 0.064268 -0.007288 -0.041550
In [259]:
sil_score = {}                                                             # Creating empty list
cluster_list = range(3, 7)                                                 # Creating a range from 3 to 7
for k in cluster_list:
    
    # Initialize K-Means with number of clusters equal to n_clusters and random_state=1
    clusterer = KMeans (n_clusters = k, random_state=1).fit(data_pca)
    
    # Fit and predict on the pca data
    preds = clusterer.predict(data_pca)
    
    # Calculate silhouette score - Hint: Use silhouette_score() function
    score =silhouette_score(data_pca, preds)
    
    # Append silhouette score to empty list created above
    sil_score[k] = score
    
    # Print the silhouette scorea
    print( "For n_clusters = {}, the silhouette score is {})".format(k, score))  
    
For n_clusters = 3, the silhouette score is 0.334493184663598)
For n_clusters = 4, the silhouette score is 0.3432079436513443)
For n_clusters = 5, the silhouette score is 0.2835368980665695)
For n_clusters = 6, the silhouette score is 0.23965552654173458)
In [260]:
# Initialize the K-Means algorithm with 3 clusters and random_state=1
kmeans = KMeans (n_clusters=3 , random_state =1)                                

kmeans.fit(data_pca)       
Out[260]:
KMeans(n_clusters=3, random_state=1)
In [261]:
data_pca["K_means_segments_3"] = kmeans.labels_                    # Adding K-Means cluster labels to the data_pca data

data ["K_means_segments_3"] = kmeans.labels_                        # Adding K-Means cluster labels to the whole data

data_model["K_means_segments_3"] = kmeans.labels_                  # Adding K-Means cluster labels to data_model
In [262]:
# Let's check the distribution
data_model["K_means_segments_3"].value_counts()
Out[262]:
0    631
1    562
2    487
Name: K_means_segments_3, dtype: int64
In [263]:
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [264]:
PCA_PLOT(0, 1, data_pca, "K_means_segments_3")
In [265]:
# Taking the cluster-wise mean of all the variables. Hint: First groupby 'data' by 'K_means_segments_3' and then find mean

cluster_profile_Kmeans_3 = data.groupby("K_means_segments_3").mean()
In [266]:
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_Kmeans_3.style.highlight_max(color="lightgreen", axis=0)
Out[266]:
  Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age kids ageing_customer Family_Size Expenses NumTotalPurchases TotalAcceptedCmp AmountPerPurchase
K_means_segments_3                                                                
0 1.483360 0.646593 36835.120461 0.787639 0.513471 50.507132 40.749604 4.004754 19.671949 5.679873 4.163233 12.090333 1.923930 2.007924 0.481775 3.270998 5.895404 0.055468 0.011094 0.000000 0.000000 0.003170 0.007924 0.033281 51.448494 1.301109 3205.152139 2.947702 86.359746 7.684628 0.103011 9.394612
1 1.503559 0.670819 61458.895922 0.213523 0.781139 47.450178 404.943060 34.010676 161.761566 45.537367 35.416370 66.471530 3.279359 6.617438 3.907473 8.688612 5.012456 0.072954 0.103203 0.030249 0.048043 0.014235 0.010676 0.115658 55.891459 0.994662 3384.370107 2.665480 748.140569 22.492883 0.384342 32.247331
2 1.308008 0.648871 33521.891698 0.710472 0.459959 46.907598 68.560575 7.521561 36.106776 11.581109 7.406571 25.166324 2.616016 2.878850 0.804928 3.554415 7.143737 0.086242 0.018480 0.000000 0.002053 0.000000 0.014374 0.176591 50.792608 1.170431 3549.509240 2.819302 156.342916 9.854209 0.283368 13.287474
In [217]:
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','ageing_customer','TotalAcceptedCmp','AmountPerPurchase']
In [218]:
# Creating boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (50, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_means_segments_3'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [267]:
data_pca.drop(['K_means_segments_3'], axis =1, inplace = True)
data.drop(['K_means_segments_3'], axis =1, inplace = True)
data_model.drop(['K_means_segments_3'], axis =1, inplace = True)
In [268]:
# Fit the K-Means algorithm using number of cluster as 5 and random_state=0 on data_pca

# Initialize the K-Means algorithm with 3 clusters and random_state=1
kmeans = KMeans (n_clusters=5 , random_state = 0)                                
# Fitting on the data_pca
kmeans.fit(data_pca)                                     
Out[268]:
KMeans(n_clusters=5, random_state=0)
In [269]:
data_pca["K_means_segments_5"] = kmeans.labels_                    # Adding K-Means cluster labels to the data_pca data

data ["K_means_segments_5"] = kmeans.labels_                        # Adding K-Means cluster labels to the whole data

data_model["K_means_segments_5"] = kmeans.labels_  
In [270]:
# Let's check the distribution
data_model["K_means_segments_5"].value_counts()
Out[270]:
1    479
2    410
3    399
0    336
4     56
Name: K_means_segments_5, dtype: int64
In [271]:
# Hint: Use PCA_PLOT function created above
#def PCA_PLOT(X, Y, PCA, cluster):
#    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [272]:
PCA_PLOT(0, 1, data_pca, "K_means_segments_5")
In [273]:
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean

cluster_profile_Kmeans_5 = data.groupby("K_means_segments_5").mean()
In [274]:
# Highlight the maximum average value among all the clusters for each of the variables
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_Kmeans_5.style.highlight_max(color="lightgreen", axis=0)
Out[274]:
  Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age kids ageing_customer Family_Size Expenses NumTotalPurchases TotalAcceptedCmp AmountPerPurchase
K_means_segments_5                                                                
0 1.464286 0.669643 64583.300611 0.178571 0.761905 47.589286 457.211310 43.863095 200.449405 57.047619 45.455357 77.324405 3.151786 6.955357 4.562500 9.827381 4.613095 0.041667 0.050595 0.011905 0.020833 0.000000 0.011905 0.041667 56.157738 0.940476 3390.607143 2.610119 881.351190 24.497024 0.166667 35.526786
1 1.471816 0.647182 34846.332478 0.841336 0.482255 45.492693 25.628392 3.196242 14.486430 4.559499 3.225470 10.252610 1.837161 1.784969 0.350731 2.981211 5.981211 0.070981 0.004175 0.000000 0.000000 0.004175 0.010438 0.050104 50.594990 1.323591 3178.455115 2.970772 61.348643 6.954071 0.129436 7.876827
2 1.304878 0.634146 29730.925017 0.812195 0.382927 52.619512 28.019512 5.231707 20.856098 7.270732 5.446341 15.548780 2.100000 1.960976 0.451220 2.992683 7.241463 0.058537 0.009756 0.000000 0.000000 0.000000 0.014634 0.134146 49.126829 1.195122 3532.712195 2.829268 82.373171 7.504878 0.202439 9.697561
3 1.491228 0.696742 50530.714929 0.378446 0.781955 49.724311 211.418546 14.598997 75.842105 21.912281 14.195489 41.355890 3.355890 5.000000 1.839599 5.839599 5.704261 0.045113 0.062657 0.000000 0.010025 0.000000 0.005013 0.070175 55.952381 1.160401 3382.553885 2.857143 379.323308 16.035088 0.187970 22.902256
4 1.625000 0.500000 60092.946429 0.285714 0.553571 39.035714 445.303571 19.017857 139.375000 31.071429 25.410714 62.035714 3.428571 7.089286 4.857143 6.982143 6.357143 0.500000 0.464286 0.232143 0.303571 0.142857 0.017857 0.910714 54.285714 0.839286 3451.839286 2.339286 722.214286 22.357143 2.553571 30.321429
In [227]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (10, 10))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_means_segments_5'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [275]:
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['K_means_segments_5'], axis =1, inplace = True)
data.drop(['K_means_segments_5'], axis =1, inplace = True)
data_model.drop(['K_means_segments_5'], axis =1, inplace = True)

K-Medoids¶

In [276]:
# Initializing K-Medoids with number of clusters as 5 and random_state=1
kmedo = KMedoids (n_clusters = 3, random_state = 1)           

# Fit and predict K-Medoids using data_pca
kmedo.fit(data_pca)
preds = kmedo.predict(data_pca)             

# Calculate the silhouette score

score = silhouette_score(data_pca, preds)             

print(score)                   # Print the score
0.1988922758795572
In [277]:
# Predicting on data_pca and ddding K-Medoids cluster labels to the whole data
data['kmedoLabels'] = preds

# Predicting on data_pca and ddding K-Medoids cluster labels to data_model
data_model['kmedoLabels'] = preds

# Predicting on data_pca and ddding K-Medoids cluster labels to data_pca
data_pca['kmedoLabels'] = preds
In [278]:
# Let's check the distribution
data["kmedoLabels"].value_counts()
Out[278]:
2    662
1    627
0    391
Name: kmedoLabels, dtype: int64
In [279]:
PCA_PLOT(0, 1, data_pca, "kmedoLabels")
In [280]:
cluster_profile_kmedoLabels = data.groupby("kmedoLabels").mean()
cluster_profile_kmedoLabels.style.highlight_max(color="lightgreen", axis=0)
Out[280]:
  Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age kids ageing_customer Family_Size Expenses NumTotalPurchases TotalAcceptedCmp AmountPerPurchase
kmedoLabels                                                                
0 1.501279 0.667519 39639.858720 0.731458 0.562660 74.212276 59.690537 5.360614 29.703325 7.437340 5.808184 17.020460 2.166240 2.452685 0.713555 3.593350 5.831202 0.061381 0.028133 0.000000 0.005115 0.000000 0.000000 0.020460 52.554987 1.294118 3234.598465 2.961637 125.020460 8.925831 0.115090 11.941176
1 1.349282 0.642743 31165.359261 0.811802 0.414673 35.393939 29.599681 4.443381 17.784689 6.047847 4.287081 13.181818 2.019139 1.923445 0.414673 2.990431 6.791069 0.078150 0.004785 0.000000 0.000000 0.003190 0.017544 0.133971 49.639553 1.226475 3400.448166 2.869219 75.344498 7.347687 0.220096 8.829346
2 1.487915 0.660121 59015.342537 0.253776 0.765861 45.577039 369.761329 30.848943 148.250755 42.471299 31.992447 63.930514 3.350453 6.379154 3.554381 8.154079 5.253776 0.067976 0.090634 0.025680 0.039275 0.012085 0.010574 0.120846 55.797583 1.019637 3408.261329 2.679758 687.255287 21.438066 0.356495 30.690332
In [281]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (10, 10))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['kmedoLabels'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [283]:
# Dropping labels we got from K-Medoids since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True


data_pca.drop(['kmedoLabels'], axis =1, inplace = True)
data.drop(['kmedoLabels'], axis =1, inplace = True)
data_model.drop(['kmedoLabels'], axis =1, inplace = True)

Hierarchical Clustering¶

Let's find the Cophenetic correlation for different distances with different linkage methods.

What is a Cophenetic correlation?¶

The cophenetic correlation coefficient is a correlation coefficient between the cophenetic distances(Dendrogramic distance) obtained from the tree, and the original distances used to construct the tree. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points.

The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.

Cophenetic correlation is the way to compare two or more dendrograms.

Let's calculate Cophenetic correlation for each of the distance metrics with each of the linkage methods

In [236]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average"]

high_cophenet_corr = 0                                                 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0]                                                    # Creating a list by assigning 0's to it

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(data_pca, metric=dm, method=lm)                    # Applying different linkages with different distance on data_pca
        c, coph_dists = cophenet(Z, pdist(data_pca))                   # Calculating cophenetic correlation
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:                                     # Checking if cophenetic correlation is higher than previous score
            high_cophenet_corr = c                                     # Appending to high_cophenet_corr list if it is higher
            high_dm_lm[0] = dm                                         # Appending its corresponding distance
            high_dm_lm[1] = lm                                         # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.4519580951404056.
Cophenetic correlation for Euclidean distance and complete linkage is 0.6618285441876266.
Cophenetic correlation for Euclidean distance and average linkage is 0.7456986470939312.
Cophenetic correlation for Chebyshev distance and single linkage is 0.461008202489796.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.6471878469387611.
Cophenetic correlation for Chebyshev distance and average linkage is 0.7184644061096592.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.5049187196205653.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.4893803759051994.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.6175117901276773.
Cophenetic correlation for Cityblock distance and single linkage is 0.5318044327779199.
Cophenetic correlation for Cityblock distance and complete linkage is 0.7105065386093503.
Cophenetic correlation for Cityblock distance and average linkage is 0.7254498869142997.
In [237]:
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.7456986470939312, which is obtained with Euclidean distance and average linkage.

Let's have a look at the dendrograms for different linkages with Cityblock distance

In [238]:
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Cityblock", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observations and Insights:

Think about it:

  • Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?
  • What is the next step in obtaining number of clusters based on the dendrogram?

Let's have a look at the dendrograms for different linkages with Chebyshev distance

In [417]:
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Chebyshev", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Let's have a look at the dendrograms for different linkages with Mahalanobis distance

In [418]:
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Mahalanobis", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )
In [419]:
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Euclidean", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )
In [420]:
# Initialize Agglomerative Clustering with affinity (distance) as Euclidean, linkage as 'Ward' with clusters=3
HCmodel = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward',) 

# Fit on data_pca
HCmodel.fit(data_pca)    
Out[420]:
AgglomerativeClustering(n_clusters=3)
In [421]:
# Add Agglomerative Clustering cluster labels to data_pca
data_pca['HCLabels'] = HCmodel.labels_
# Add Agglomerative Clustering cluster labels to the whole data
data['HCLabels'] = HCmodel.labels_
# Add Agglomerative Clustering cluster labels to data_model
data_model['HCLabels'] = HCmodel.labels_
In [422]:
# Let's check the distribution
data_pca.HCLabels.value_counts()
Out[422]:
1    786
0    463
2    431
Name: HCLabels, dtype: int64
In [423]:
PCA_PLOT(0, 1, data_pca, "HCLabels")
In [424]:
cluster_profile_HCLabels_3 = data.groupby("HCLabels").mean()
cluster_profile_HCLabels_3.style.highlight_max(color="lightgreen", axis=0)
Out[424]:
  Recency ageing_customer Expenses NumTotalPurchases TotalAcceptedCmp AmountPerPurchase
HCLabels            
0 47.278618 3404.034557 812.701944 23.548596 0.401728 33.764579
1 59.708651 3279.685751 125.041985 8.863868 0.045802 11.178117
2 29.141531 3474.470998 177.547564 10.252900 0.457077 14.160093
In [425]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (10, 10))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['HCLabels'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [426]:
# Dropping labels we got from Agglomerative Clustering since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['HCLabels'], axis =1, inplace = True)
data.drop(['HCLabels'], axis =1, inplace = True)
data_model.drop(['HCLabels'], axis =1, inplace = True)

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

In [284]:
# Initializing lists
eps_value = [2,3]                       # Taking random eps value
min_sample_values = [6,20]              # Taking random min_sample value

# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}  
In [285]:
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0                                               # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0]                                                     # Assigning 0's to the high_i_j list
key = res.keys()                                                      # Assigning dictionary keys to a variable called key
for i in key:
    z = res[i]                                                        # Assigning dictionary values of each i to z
    for j in z:
        db = DBSCAN(eps=i, min_samples=j).fit(data_pca)               # Applying DBSCAN to each of the combination in dictionary
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        silhouette_avg = silhouette_score(data_pca, labels)           # Finding silhouette score 
        print( 
            "For eps value =" + str(i),
            "For min sample =" + str(j),
            "The average silhoutte_score is :",
            silhouette_avg,                                          # Printing the silhouette score for each of the combinations
        )
        if high_silhouette_avg < silhouette_avg:                     # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j              
            high_i_j[0] = i
            high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.5050254980203432
For eps value =2 For min sample =20 The average silhoutte_score is : 0.47408529133829885
For eps value =3 For min sample =6 The average silhoutte_score is : 0.48036844597375694
For eps value =3 For min sample =20 The average silhoutte_score is : 0.5805508949885789
In [286]:
# Printing the highest silhouette score
print("Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20
In [291]:
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps=3, min_samples=20).fit(data_pca)  
In [292]:
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBSLabels'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_model
data_model['DBSLabels'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_pca
data_pca['DBSLabels'] = dbs.fit_predict(data_pca)
In [293]:
# Let's check the distribution
data_pca["DBSLabels"].value_counts()
Out[293]:
 0    1674
-1       6
Name: DBSLabels, dtype: int64
In [294]:
PCA_PLOT(0, 1, data_pca, "DBSLabels")
In [295]:
data.drop(['DBSLabels'], axis =1, inplace = True)
data_model.drop(['DBSLabels'], axis =1, inplace = True)
data_pca.drop(['DBSLabels'], axis =1, inplace = True)
In [296]:
gmm = GaussianMixture(n_components =5, random_state=1) # Initialize Gaussian Mixture Model with number of clusters as 5 and random_state=1

# Fit and predict Gaussian Mixture Model using data_pca

gmm.fit(data_pca)

preds = gmm.predict(data_pca)            

score = score =silhouette_score(data_pca, preds)           # Calculate the silhouette score

print(score)                   # Print the score
0.1936138024032675
In [297]:
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to the whole data
data['GmmLabels'] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_model
data_model['GmmLabels'] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_pca
data_pca['GmmLabels'] = preds
In [298]:
# Let's check the distribution

data_model["GmmLabels"].value_counts()
Out[298]:
3    704
0    664
2    125
4    116
1     71
Name: GmmLabels, dtype: int64
In [299]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (10, 10))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['GmmLabels'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: